Protecting world leaders against deep fakes using facial, gestural, and vocal mannerisms

Since their emergence a few years ago, artificial intelligence (AI)-synthesized media—so-called deep fakes—have dramatically increased in quality, sophistication, and ease of generation. Deep fakes have been weaponized for use in nonconsensual pornography, large-scale fraud, and disinformation campaigns. Of particular concern is how deep fakes will be weaponized against world leaders during election cycles or times of armed conflict. We describe an identity-based approach for protecting world leaders from deep-fake imposters. Trained on several hours of authentic video, this approach captures distinct facial, gestural, and vocal mannerisms that we show can distinguish a world leader from an impersonator or deep-fake imposter.

Tumor cell hyperproliferation results in cell crowding causing nutrients and oxygen deprivation, leading to hypoxia (1). To meet the energetic demands of cancer cells, mitochondria consume the cellular oxygen, resulting in oxidative phosphorylation, leading to reactive oxygen species (ROS) production (2).
While low to mild ROS levels promote oncogenic signaling and malignancy, high levels of ROS cause DNA damage and apoptosis (3,4), an effect that is often co-opted by chemotherapy or radiation to target cancer cells (5). Tumor cells evade ROS cytotoxicity by increasing the expression of antioxidant enzymes such as superoxide dismutase, periodoxin-theriodoxin, catalases, and glutathione peroxidases (6,7), which generally convert hydrogen peroxide produced by mitochondrial electron leak into water using glutathione (8).
ROS are known to stimulate oncogenic signaling with special emphasis on hypoxia-inducible factor 1-alpha (HIF1α). ROS stabilize HIF1α protein via inhibition of the oxygen-sensing propyl hydroxylase protein D (PHD), which normally marks HIF1α for proteasomal degradation (9,10). HIF1α promotes malignancy via effects on tumor angiogenesis, proliferation, epithelial-to-mesenchymal transition, stemness, and glucose metabolism (1,11). HIF1α stimulates VEGFA gene transcription which promotes angiogenesis, thereby increasing nutrient availability and oxygen supply to hypoxic tumor areas (12,13). Paradoxically, VEGFA overproduction may also cause vascular malfunction, resulting in immature or poorly perfusing vessels, thereby exacerbating hypoxia (14). This further stabilizes HIF1α protein, which shifts cells from oxidative phosphorylation (OXPHOS) to aerobic glycolysis, known as the Warburg effect (12,15). While OXPHOS generates high levels of ATP as compared to glycolysis, tumor cells leverage glucose metabolism to generate building blocks for biomass biosynthesis (16). However, aggressive cancer cells were also shown to be able to use OXPHOS and glycolysis, which might be necessary to survive under hypoxic and aerobic conditions that can be encountered at the primary tumor, in circulation, or at metastatic sites (17,18).
A comparison of carcinoma cell lines derived from the polyoma middle T (PyMT) mammary tumor model unraveled a dramatic downregulation of glutathione peroxidase 2 (GPx2) in metastatic relative to non-metastatic cells from the parental tumor. Moreover, the loss of GPx2 in several molecular breast cancer (BC) subtypes was correlated with poor patient survival, underscoring the clinical significance of GPx2 loss in BC. GPx2 knockdown (KD) in murine and human BC cells stimulates ROS/HIF1α/VEGFA signaling which enhanced malignant progression via vascular modulation, resulting in poor perfusion, hypoxia, and a shift from OXPHOS to aerobic glycolysis (the Warburg effect). Transcriptomic analysis of scRNA-seq data and bioenergetic profiling confirmed that

Significance
Redox regulation of breast cancer underlies malignant progression. Loss of the antioxidant glutathione peroxidase 2 (GPx2) in breast cancer cells increases reactive oxygen species (ROS), thereby activating hypoxia-inducible factor 1-alpha (HIF1α) signaling. This in turn causes vascular malfunction, resulting in hypoxia and metabolic heterogeneity. HIF1α suppresses oxidative phosphorylation and stimulates glycolysis (the Warburg effect) in most of the tumors, except for one cancer subpopulation which was able to use both metabolic modalities. Hence, adopting a hybrid metabolic state may allow tumor cells to survive under aerobic or hypoxic conditions, a vulnerability that may be exploited for therapeutic targeting by either metabolic or redox-based strategies.
Since their emergence a few years ago, artificial intelligence (AI)-synthesized mediaso-called deep fakes-have dramatically increased in quality, sophistication, and ease of generation. Deep fakes have been weaponized for use in nonconsensual pornography, large-scale fraud, and disinformation campaigns. Of particular concern is how deep fakes will be weaponized against world leaders during election cycles or times of armed conflict. We describe an identity-based approach for protecting world leaders from deep-fake imposters. Trained on several hours of authentic video, this approach captures distinct facial, gestural, and vocal mannerisms that we show can distinguish a world leader from an impersonator or deep-fake imposter.
In the early days of the Russian invasion of Ukraine, President Zelenskyy warned the world that Russia's digital disinformation machinery would create a deep fake of him admitting defeat and surrendering. By mid-March of 2022, a deep fake of Zelenskyy appeared with just this message (1). This video was debunked thanks to the rather crude audio and video and to Zelenskyy's prebunking, but not before it spread across social media and appeared briefly on Ukrainian television. Three months later, the mayors of Berlin, Madrid, and Vienna each held extended video-based conversations with a deep-fake version of Kiev Mayor Klitschko.
In addition to adding jet fuel to disinformation campaigns, this new breed of synthetic media also makes it easier to deny reality-the so-called liar's dividend (2)-as seen by the recent baseless claim that video addresses by President Biden are deep fakes deployed to conceal his death or incapacitation (3).
These recent events are likely just the beginning of a new assault on reality in both recorded and live videos. While we may have the ability to perceptually detect some deepfake videos (4), our ability is not always reliable, and this task will become increasingly more difficult as deep fakes continue to improve in quality and sophistication.
The computational detection of deep-fake videos can be divided into three categories: 1) learning-based, in which features to distinguish the real from fake are explicitly learned (5); 2) artifact-based, in which low-level to high-level features are explicitly designed to distinguish the real from fake (6); and 3) identity-based, in which biometricstyle features are used to identify whether the person depicted in a video is who it purports to be (7).
The advantage of identity-based techniques is that they are resilient to adversarial and laundering attacks and are applicable to the many different forms of deep fakes. The disadvantage is that these techniques require an identity-specific model, generated from several hours of authentic video footage. Because we are focused on protecting world leaders for whom hours of video can easily be acquired, we contend that an identity-based approach is the most sensible and robust approach.
We describe an integrated facial, gestural, and vocal model that captures an individual's distinctive speaking mannerisms. After training a model on several hours of authentic video, the model can be used to distinguish one person from impersonators and deepfake imposters. Given the recent attacks on President Zelenskyy-and the associated risks in the fog of war-we describe an in-depth analysis of the efficacy of this model in protecting President Zelenskyy against deep-fake attacks. Our approach, however, can be deployed to protect any world leader or other high-profile public figures. The columns correspond to the classification accuracy for three different true positive rates (95.0% (red), 99.0% (green), and 99.9% (blue)) across real and fake (decoy and deep-fake) video segments of Zelenskyy.
correspond to the accuracy on different datasets and different true positive rates of 95.0%, 99.0%, and 99.9% (i.e., correctly classifying a real video segment).
With only a single feature set, the classifier struggles to consistently distinguish Zelenskyy from decoys and deep fakes. The pair-wise combined features (rows 4-6) provide a boost in accuracy with the triple features (row 7) providing a significant boost yielding perfect accuracy across all datasets. Note that the vocal features provide no benefit for the FaceForensics++ dataset because these videos contain no audio track.
To determine how many of the 780 behavioral features are needed to achieve the classification accuracy reported in Table 1, we trained a series of classifiers on randomly selected subsets of features. Shown in Fig. 1 is the median accuracy of classifying the identities in the World Leaders and both Deep-Fake Zelenskyy datasets for a behavioral model with between 10 and 600 features and for three different true positive rates. Accuracy grows relatively slowly between 10 and 400 features plateauing at 99.5% (for a 99.9% true positive), increasing to 99.99% at 600 features, before topping out at 100.0% at 780 features. A significant fraction of the facial, gestural, and vocal features are informative.
We next trained 4,000 classifiers on random feature subsets of size 10. The discriminatory power of each feature is computed as the weighted average of each classifier accuracy to which a feature contributed. Classifier accuracy ranges between 18.1% and 1.2%, with a median accuracy of 2.4%. The following are five most and least discriminative features' correlation and respective accuracy (where MFCC-n corresponds to vocal characteristics). Note first that features from the facial, gestural, and vocal sets are each represented in the top-five most distinctive features. The most distinctive feature corresponds to the particular way that President Zelenskyy tends to gesture with his left arm while leaving his right arm dangling to his side, yielding a strong correlation between the movement of his right elbow and right shoulder as he moves side to side. The AU12-AU15 feature corresponds to a slight but consistent asymmetry in President Zelenskyy's smile where the right side of his mouth turns slightly upward while the left side turns slightly downward. These highly specific correlations suggest that a deep-fake imposter will have considerable difficulty in precisely capturing a person's behavioral mannerisms.

Discussion
Any approach to detecting manipulated media is vulnerable to the inherent cat-and-mouse game between the creator and detector. In our case, however, we benefit from a three-pronged approach that analyzes facial, gestural, and vocal patterns and, importantly, the interplay between them. Synthesis engines are unaware of these semantic patterns, and therefore, direct counterattack by modeling these patterns is (currently) unlikely. We also benefit from a relatively long analysis time window (10 s). By comparison, synthesis engines (currently) generate only one video frame, or a few, at a time and are therefore unaware of these longer temporal patterns. Despite this seemingly upper hand, we do not expect to publicly release our classifier so as to slow counterattacks. We will, however, make our classifier available to reputable news and government agencies working to counter deep-fake-fueled disinformation campaigns.
Although we have focused our analysis on one world leader, our approach is applicable to any world leader or high-profile individual for whom sufficient authentic video is available. Our experience has been that at least 8 h of training video is required. (By focusing on a few central repositories, this data can be efficiently scraped in a matter of hours.) Deep fakes continue their trajectory of creating increasingly higher quality fake videos while continually lowering barriers to creation in terms of data, computing power, and technical skill required. This democratized access to powerful video synthesis technology is sure to yield exciting and entertaining applications while simultaneously posing new threats to our ability to trust what we see and hear online.

Materials and Methods
A. Dataset. We downloaded 506 min of video of President Zelenskyy from YouTube and the official website of the office of the Ukrainian president * in four different contexts: a) public address (91 min); b) press briefing (207 min); c) bunker (47 min); and d) armchair (161 min). Portions of each video with large camera motions (e.g., zoom, translation, cross-fade) were automatically detected and removed from the dataset.
A total of 57 min of interview-style videos of seven world leaders (Jacinda Ardern, Joe Biden, Kamala Harris, Boris Johnson, Wladimir Klitschko, Angela Merkel, and Vladimir Putin) were used as decoys (i.e., not Zelenskyy). Our deepfake detection is designed to distinguish Zelenskyy's behavioral and gestural mannerisms from imposters driving the creation of a deep fake, so these decoy videos-regardless of the identities-serve as proxies for deep fakes. An C. Gestural Mannerisms. Foreachvideo segment,thearm andhandpositions and movement are estimated on a per-frame basis using Blazepose (12) from the MediaPipe library (13). Because we are interested only in the upper body, we consider the image x-, y-coordinates corresponding to the shoulder, elbow, and wrist of both arms, yielding a total of 12 individual measurements per video segment. These upper-body coordinates, initially specified relative to the video-frame size, are normalized into a speaker-centric action plane (14). This action plane is a rectangular bounding box centered on the speaker's chest with a width 8× and height 6× the measured head height (15). This normalization ensures that the tracked upper-body coordinates can be compared across different speaker locations and sizes. do so, a periodogram is first estimated (using a Hann window), followed by an application of a mel-spaced filter bank and normalization by the respective filter bandwidths. Last, the MFCCs are computed using a discrete cosine transform (DCT-III) on the log-scaled weights, yielding an eight-dimensional signal per video segment. Trained on authentic video of a person of interest, a novelty detection model in the form of a one-class, nonlinear support vector machine (SVM) (17) is used to distinguish an individual from impersonators and deep fakes. The 123 min of decoy identities in the World Leaders, FaceForensics++, and Deep-Fake Zelenskyy videos are partitioned into overlapping (by 1/6-s) 10-s video segments, yielding a total of 25,267 segments. The 506 min of Zelenskyy video is similarly partitioned yielding a total of 157,746 segments. The authentic segments are then randomly partitioned into a 80/10/10 training/validation/testing split. The SVM hyperparameters, consisting of the Gaussian kernel width (γ ) and outlier percentage (ν), are optimized by performing a grid search over these parameters across the training set. The validation set is used to determine the SVM classification threshold that yields a specified true positive rate (correctly classifying a video segment as Zelenskyy). The classifier is then evaluated against the testing set. This entire process is repeated 10 times with randomized data splits, from which we report average classification accuracy in Table 1 and Fig. 1.
Data, Materials, and Software Availability. Some study data available (The data associated with this manuscripts includes training videos which we will make available. We prefer not to make available the other data in the form of the trained behavioral models because we fear that this could be used by an adversary to evaluate the realism of fake videos. We will, however, upon request make our model available to researchers working in the general space of digital forensics) (18).